OPERATING PROCEDURE R: SUNBURST PLOT

Introduction

Sunburst #This is an exemple of a sunburst plot 

Parasite of ruminants cause diseases of major socio-economic importance worldwide.The Gastrointestinal Nematodes (GIN), are currently the most important disease problem of sheep worldwide causing reductionin meat, milk, and fibre production. GIN are rapidly becoming more and more resistant to all the different anthelmintic drug treatments, which is why analysing parasitic GIN community structure is essential for diagnostics, surveillance and research.

This is why the Nemabiome deep amplicon sequencing targeting ITS-2 (Internal Transcribed Spacer 2) region of nematode nuclear ribosomal DNA has been developed. This next generation sequencing provides a picture of the species composition of the GIN parasite community structure in small and large sample sets. The Nemabiome sequencing pipeline gives us the results condensed into a twelves sheet excel file which indicates IdTaxa (60 and 30) results summed up by species or by ASV.

IdTaxa is a taxonomic classification tool to classify rRNA or ITS sequences. ASV or Amplicon sequence variant refers to individual DNA sequences recovered using a high-throughput marker gene analysis when false sequences are removed after sequencing. The excel file also contains the results obtained from BLAST. BLAST or Basic Local Alignment Search Tool is a program which finds regions of local similarity between sequences by comparing either nucleotide or protein sequences to sequence databases in order to calculate statistical significance. Unfortunately, the results obtained in those different excel sheet are not simple to read and to analyse and the most compelling information’s might be lost without proper data visualization. Using the sequencing data output, it would be great to be able to understand the different classification algorithms used BLAST and IdTaxa and highlight various piece of information’s emerging from these classifications. This walkthrough will be using interactive plots as an additional way to highlight any classification discrepancies which are hard to visualize by only looking at the excel spreadsheets.

Sunburst plot (exemple above) is a tool to enable visualisation of hierarchal data spanning outward radially from root to leaves. This tutorial will use this tool to visualize the Taxonomic classification of the gastrointestinal nematodes found in our samples, using the Nemabiome sequencing data.

The Walkthrough is using as an example a dataset of 10 deer samples

Walkthrough

Files Needed

Ensure you know where all the following files are located on your computer (file directories):

You will need the FinalSummary.xlsx excel file obtained after running the Nemabiome pipeline. The “IdTaxa60Classification” tab will be specifically used.


Packages

The first step is to install the following packages. These packages are available in the Comprehensive R Archive Network (CRAN) repository and can be installed via install.packages(). By default, the latest version available will be installed.

  1. dplyr package: Commonly use to modify data frame like objects, both in memory and out of memory by providing set of verbs such as select(), summarize() or group_by().

  2. ggplot2 package: Creates graphics using “The Grammar of Graphics”. This package takes into consideration the variables mapped to its aesthetics function as well as the graphical function chosen and creates a graph.

  3. openxlsx package: Provides an easier way to interact with Excel.xlsx files. Allows for the creation, styling and editing of excel spreadsheets

  4. plotly package: Turns the graphics designed using ggplot2 into interactive web graphics.

  5. reshape2 package: Restructure and aggregate data with function such as melt()

  6. tidyr package: Tidy data sets using functions such as gather()

  7. stringr package: Treats with common string operations. This package contains a cohesive set of functions designed to work with strings in a simple manner

##remove the # to install, leave the # if already installed 

#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("openxlsx")
#install.packages("plotly")
#install.packages("reshape2")
#install.packages("tidyr")
#install.packages("stringr")

With the packages now installed, you can load them into R (or Rstudio) via the library() function.

## Load packages into R (Remove # to load)

#library(openxlsx)
#library(reshape2)
#library(ggplot2)
#library(plotly)
#library(dplyr)
#library(tidyr)
#library(stringr)

Setting Path to Samples

Here, you are telling R where your samples are located. This directory will need to be changed depending on where your samples are on the server/locally. The Variables to be Changed tab contains all directory changes in a more concise manner.

Setting Path to Samples: R needs to know where your samples are

My_Data <- read.xlsx("/Users/eleonorecharrier/Dropbox (Gilleard Lab)/Eleonores work/Coding_projects/Eleonore/Summary.xlsx", sheet = "IdTaxa60Classification")  

Colors

This step allows to format species color into an object which can be reuse later on for graph aesthetic purposes.

https://www.rapidtables.com/web/color/RGB_Color.html

This link will provide easy hexadecimal colors code. You can pick your color and format species color. If you need to add new species, you can add another line as such: “genus_species” = “#hexadecimalcode”

## This first colour object will be later on used in the aesthetic for the stackbar plot

manualcolorSb <- c("NA" = 'black',
                    "Strongylida"= '#800000',
                    "Haemonchidae" = '#b30000',
                    "Chabertiidae" = '#ff6600',
                    "Molineidae" = '#ffcc00',
                    "Spiculopteragia" = '#ff1a1a',
                    "Ostertagia" = '#ff4d4d',
                    "Haemonchus" = '#ff8080',
                    "Oesophagostomum" = '#ff944d',
                    "Nematodirus" = '#ffe680',
                    "Spiculopteragia_boehmi" = '#ff8566', 
                    "unclassified_Spiculopteragia" = '#ff3333',
                    "unclassified_Ostertagia" = '#ffcccc',
                    "Haemonchus_placei" = '#ffe6e6',
                    "unclassified_Haemonchus" = '#ffccff',
                    "Oesophagostomum_venulosum" = '#ffc299',
                    "Nematodirus_belvetianus" = '#fff5cc') 

Functions

This step will allow the creation of custom functions which will be later of use for the construction of sunburst plot. These custom functions are created as a way to simplify the code as well as a way to make it more understandable. Those custom functions will also allow the user to have a code frame where only their data input will be changed. This will make the code more stable and reliable.

The first custom function will render possible the gathering of column of interest from the mother file of choice. This function will allow the user to replace both of those by: attributing the column of interest to “To_Replace” and the mother dataframe of choice to “Data_Of_Choice”. This function use the gather() function to collect a set of wanted column names and places them into a single “key” column. This function is used twice. A first time, to gather the column samples of choice into a single column called “Taxonomy_Value”, and a second time to gather the number of reads per “Taxonomy_Value”. The group_by() function is then used to group our two columns by the “Taxonomy_Value” column. Finally, the summarise() on the grouped data previously created by the group_by() function. This function applied to the number of reads per species will create one row for each “Taxonomy_Value”. If ordered into a new object, this function will create a new data frame containing only the number of reads per taxonomy_value (species).

Function_Gather <-function(To_Replace, Data_Of_Choice) {
 To_Replace <- enquo(To_Replace)
   Data_Of_Choice %>% 
  gather(key = "Sample", value = "Read", !!To_Replace) %>%
    gather(key = "taxonomy", value = "Taxonomy_Value", order:species) %>%
      group_by(Taxonomy_Value) %>%
        summarise(Total = sum(Read))
}

The second custom function will arrange a data frame to be assigned to “Name_Arrange” in a custom order. The arrange() and match() functions are used in this custom function to reorder the rows by one or more variable from the wanted column “Column_Of_Interest” which will be attributed to the object “Name_Taxonomy_Replace”. The custom order of the rows is the order in which your classification is in the “Labels” column of the “Sunburst Table” tab of the walkthrough.

Function_Arrange <- function(Name_Arrange, Column_Of_Interest ,Name_Taxonomy_Replace){
Name_Arrange %>% 
  arrange(match(Column_Of_Interest, c(Name_Taxonomy_Replace)))
} 

The third custom function has for aim to create a data frame which can be use to create a Sunburt plot. In order to do so, the data.frame() function is used to create a dataframe with the different parameters needed for the construction of Sunburst plot.

The sunburst sectors are determined by the entries in “ids”, “labels” and “parents”. - ids: This will assign id labels to each datum. These ids will be used as an object constancy of the different data points during the sunburst animation. They should only be an array of strings. - labels: This will enable us to set the labels for each of the different sectors. See “Labels” column of the “Sunburst Table” tab of the walkthrough. - parents: This will set the parent sectors for each of the different created sectors. The use of empty string items ’’ will be understood as a way to reference the root node in the hierarchy. In this case, as the “ids” are filled, the “parents” items are understood to also be “ids” themselves. See “Parents” column of the “Sunburst Table” tab of the walkthrough. - size: This will attribute the size of each of the different sectors.

Function_DataFrame_Sb <- function(Data_Arrange_Taxonomy, Label_Order, Parent_Order, Data_Arrange_Total){
data.frame(
    ids = c(Data_Arrange_Taxonomy) ,
  labels = c(Label_Order),
  parents = c(Parent_Order),
  size = Data_Arrange_Total,
   stringsAsFactors = FALSE) 
}

The fourth and final custom function will be used to create and allow the Sunburst plot to be interactive. In order to do so, the plot_ly() function is used. The hovertext is attributed to the selected data to appear when hovering the different sectors of the sunburst plot.

#This function will allow for the creation of the Sunburst Plot
Function_Sb <- function(Sunburst_Dataset, Dataset_Arrange_Total){
plot_ly(Sunburst_Dataset, 
        ids = ~ids, 
        labels = ~labels, 
        parents = ~parents, 
        type = 'sunburst',
        hovertext = ~Dataset_Arrange_Total,
        marker = list(colors = manualcolorSb))
}

Sunburst Table

This table is an example of how to create the table to understand the order of the sunburst more easily. This table is based on the 10 deer samples example. This table will help when allocating the parents and labels into the sunburst plot. This table will vary with each set of data.

Step 1

Step 2

Final Table:

Sunburst Classification Table.


Sunburst Plot

Creating this sunburst will allow for a better understanding of the taxonomic relationship between the different gastrointestinal species population found within the samples.

First, the read.xlsx() function is used to read specifically the IdTaxa60Classification sheet in the sequencing excel spreadsheet results. The IdTaxa 60 classification results is used as it represents a confidence threshold of 60% which is higher than the IdTaxa 30 classification results, thus more robust.Its structure is looked at to understand and look if changes need to be made.

My_Data <- read.xlsx("/Users/eleonorecharrier/Dropbox (Gilleard Lab)/Eleonores work/Coding_projects/Eleonore/Summary.xlsx", sheet = "IdTaxa60Classification")  #!!The first object of this function needs to be changed to wherever the `FinalSummary.xlxs` file is in your working directory

After looking at the structure of our data frame “My_Data”, a new data frame, containing only the column which will be of our interest in the creation of the Sunburst plot, is created.

The custom function “Function_Gather” create above is used. This allows to only select the column 193_S193_L001_R1_001.fastq.gzto the column273_S273_L001_R1_001.fastq.gz (includes all the fasta.gz column of the FinalSummary.xlxs file) and consider them as ‘Read’ values. This function gives a new data frame “New_Data” grouped by taxonomy_value with the number of reads for each of these values.

New_Data <- Function_Gather(`193_S193_L001_R1_001.fastq.gz`:`273_S273_L001_R1_001.fastq.gz`, My_Data)

It is then needed to define NA as a string character “Unknown” in the Taxonomy_Value column. This will be needed for the next step of rearranging the data is a custom order. The data is a data frame thus, a named list giving the value, here “Unknown” is used to replace NA for each column using the replace_na() function.

str(New_Data) #Check if the the colunm of interest were corectly selected 
## Classes 'tbl_df', 'tbl' and 'data.frame':    17 obs. of  2 variables:
##  $ Taxonomy_Value: chr  "Chabertiidae" "Haemonchidae" "Haemonchus" "Haemonchus_placei" ...
##  $ Total         : num  4409 230883 47433 42715 18953 ...
New_Data[17,1]="NA"
New_Data %>% replace_na(list(x = 0, y = "NA"))
## # A tibble: 17 x 2
##    Taxonomy_Value                Total
##    <chr>                         <dbl>
##  1 Chabertiidae                   4409
##  2 Haemonchidae                 230883
##  3 Haemonchus                    47433
##  4 Haemonchus_placei             42715
##  5 Molineidae                    18953
##  6 Nematodirus                   18953
##  7 Nematodirus_helvetianus       18953
##  8 Oesophagostomum                4409
##  9 Oesophagostomum_venulosum      4409
## 10 Ostertagia                    43716
## 11 Spiculopteragia              139734
## 12 Spiculopteragia_boehmi        48385
## 13 Strongylida                  254245
## 14 unclassified_Haemonchus        4718
## 15 unclassified_Ostertagia       43716
## 16 unclassified_Spiculopteragia  91349
## 17 NA                            19368

This is the last step in rearranging our new data frame. Our data is now organized in two column, one for the taxonomy classification and one for the number of reads allocated to each level of the taxonomy classification. The last step is to arrange the data in the custom order needed for the organisation of the sunburst plot. To do so, the custom function “Function_Arrange” create above is used.

New_Data_Arrange <- Function_Arrange(New_Data, New_Data$Taxonomy_Value ,c("NA", "Strongylida", "Haemonchidae", "Chabertiidae", "Molineidae", "Spiculopteragia", "Ostertagia", "Haemonchus", "Oesophagostomum", "Nematodirus", "Spiculopteragia_boehmi", "unclassified_Spiculopteragia", "unclassified_Ostertagia", "Haemonchus_placei", "unclassified_Haemonchus", "Oesophagostomum_venulosum", "Nematodirus_belvetianus"))

Using the custom function “Function_DataFrame_Sb” created above, the ids, labels, parents and size are assigned to the sunburst plot.

DF_Sb1 <- Function_DataFrame_Sb(c(New_Data_Arrange$Taxonomy_Value),
          c("NA", "Strongylida", "Haemonchidae",  "Chabertiidae", "Molineidae", "Spiculopteragia", "Ostertagia", "Haemonchus", "Oesophagostomum", "Nematodirus", "Spiculopteragia<br>boehmi", "unclassified<br>Spiculopteragia", "unclassified<br>Ostertagia", "Haemonchus<br>placei", "unclassified<br>Haemonchus", "Oesophagostomum<br>venulosum", "Nematodirus<br>belvetianus"),
          c("","", "Strongylida","Strongylida","Strongylida", "Haemonchidae", "Haemonchidae", "Haemonchidae","Chabertiidae", "Molineidae", "Spiculopteragia", "Spiculopteragia", "Ostertagia", "Haemonchus", "Haemonchus","Oesophagostomum", "Nematodirus"),
          New_Data_Arrange$Total)

Finally, using the custom function “Function_Sb” created above, the sunburst plot of the “NUMBERS OF READS PER LEVEL OF TAXONOMY WITHIN YOUR SAMPLE” will be created. It is an interactive plot, thus if you over each section, it will inform you on the number of reads for each taxa. In addition, by clicking on a specific taxa you can zoom the sunburst on that specific taxa.

Sb1 <- Function_Sb(DF_Sb1,New_Data_Arrange$Total)
Sb1 ##To visualize our sunburst plot 

Variables To Be Changed

!The bold part are to be changed in your code (TO CHANGE)!



  • My_Data: You will need to change the directory to where your FinalSummary.xlsx excel file obtained after running the Nemabiome pipeline
  My_Data <- read.xlsx(TO CHANGE, sheet = "IdTaxa60Classification") 


  • New_Data: You will need to change the first function variable with your fasta.gz column of interest.
  New_Data <- Function_Gather(TO CHANGE, My_Data)


  • You will need to change the row number to the number where the NA row is in you data
  New_Data[TO CHANGE,1]="NA"


  • New_Data_Arrange: You will need to change the last function variable with the Classification name in the taxomic order of choice as seen in the Labels “Sunburst table” tab of the walkthrough.
  New_Data_Arrange <- Function_Arrange(New_Data, New_Data$Taxonomy_Value ,c(TO CHANGE LABEL)


  • DF_Sb1: You will have to change the second and third function variable with the correct 1. label and 2. parent order according to the classifiction chosen (should match the label and parent order from your “Sunburst table”)
DF_Sb1 <- Function_DataFrame_Sb(c(New_Data_Arrange&Taxonomy_Value),
          c(TO CHANGE LABEL),
          c(TO CHANGE PARENT),
          New_Data_Arrange$Total)


Example